{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Discussion 6\n",
    "\n",
    "In this discussion section we study 1) PCA from the latent variable perspective, and 2) $\\ell_1$ (LASSO) vs $\\ell_2$ (ridge) regularization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import sklearn\n",
    "from sklearn.decomposition import PCA\n",
    "from ipywidgets import interactive\n",
    "import ipywidgets as widgets\n",
    "from matplotlib import pyplot as plt\n",
    "from sklearn.linear_model import LinearRegression, Ridge, Lasso "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part (a): PCA from the Latent Variable Perspective ##"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this part, we study PCA from the latent variable perspective. \n",
    "\n",
    "Suppose the observed features $\\vec{X} \\in \\mathbb{R}^m$ orginates from latent features $\\vec{L} \\in \\mathbb{R}^{\\ell}$, where $\\ell \\leq m$. Running PCA on many samples of $\\vec{X}$ will recover the latent features.\n",
    "\n",
    "Specifically in this example, we assume the latent variable is uniformly sampled from an ellipse. Specifically, let $$\\vec{L'} = [2sin(\\theta), cos(\\theta))]^T \\in \\mathbb{R}^2$$ where $\\theta$ is sampled from the uniform distribution $U(0, 2\\pi)$. We then rotate $\\vec{L'}$ with a random rotation matrix to obtain $\\vec{L}$ .\n",
    "\n",
    "Observed $\\vec{X}$ follows a linear transformation of $\\vec{L}$ plus some iid noise: $$\\vec{X} = W\\vec{L} + \\vec{N}$$ where the noise $\\vec{N} \\sim \\mathcal{N}(\\vec{0}, \\sigma^2 I)$.\n",
    "\n",
    "We observe $n=1000$ data points of $\\vec{X}$. Let's generate the data first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_l(N):\n",
    "    theta = np.random.uniform(low=0.0, high=2*np.pi, size=N)\n",
    "    L = np.vstack((2*np.sin(theta), np.cos(theta))).transpose()\n",
    "    return L@orth_basis(2,2)\n",
    "def generate_X(L, dim_m, sigma_n = 0.5):\n",
    "    N, dim_l  = L.shape\n",
    "    noise = np.random.normal(0, sigma_n, (N, dim_m))\n",
    "    #W = np.random.rand(dim_l, dim_m)\n",
    "    W = orth_basis(dim_m, dim_l)\n",
    "    X = L@np.transpose(W) + noise\n",
    "    return X, X@W"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then let's fit a PCA model on $X$ and project it on the first $\\ell=2$ coordinates. Note we fix $l=2$ for the convenience of visualization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "def gen_data_and_fit_PCA(dim_m, sigma_n, normalize=True):\n",
    "    ## generate data\n",
    "    dim_l = 2\n",
    "    N = 1000\n",
    "\n",
    "    L = generate_l(N)\n",
    "    X, X_hat = generate_X(L, dim_m, sigma_n)\n",
    "    ## fit PCA\n",
    "    pca = PCA(n_components=dim_l)\n",
    "    pca.fit(X)\n",
    "    L_hat = pca.fit_transform(X)\n",
    "    L_rand = random_project_data(X, dim_l)\n",
    "    if normalize:\n",
    "        L = normalize_l(L)\n",
    "        L_hat = normalize_l(L_hat)\n",
    "        L_rand = normalize_l(L_rand)\n",
    "        X_hat = normalize_l(X_hat)\n",
    "    return L, L_hat, L_rand, X_hat\n",
    "def orth_basis(dim, dim_l):\n",
    "    ## This function creates orthogonal basis from random projection\n",
    "    random_state = np.random\n",
    "    H = np.eye(dim)\n",
    "    D = np.ones((dim,))\n",
    "    for n in range(1, dim):\n",
    "        x = random_state.normal(size=(dim-n+1,))\n",
    "        D[n-1] = np.sign(x[0])\n",
    "        x[0] -= D[n-1]*np.sqrt((x*x).sum())\n",
    "        # Householder transformation\n",
    "        Hx = (np.eye(dim-n+1) - 2.*np.outer(x, x)/(x*x).sum())\n",
    "        mat = np.eye(dim)\n",
    "        mat[n-1:, n-1:] = Hx\n",
    "        H = np.dot(H, mat)\n",
    "        # Fix the last sign such that the determinant is 1\n",
    "    D[-1] = (-1)**(1-(dim % 2))*D.prod()\n",
    "    # Equivalent to np.dot(np.diag(D), H) but faster, apparently\n",
    "    H = (D*H.T).T\n",
    "    return H[:, :dim_l]\n",
    "def random_project_data(X, dim_l):\n",
    "    dim_m = X.shape[1]\n",
    "    W = orth_basis(dim_m, dim_l)\n",
    "    #W = np.random.rand(dim_m, dim_l)\n",
    "    return X@W"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's visualize the ground-truth latent variable $L$ and recovered $\\hat{L}$. Note we need to normalize them first to eliminate scaling issues."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "def normalize_l(L):\n",
    "    return L/np.linalg.norm(L, ord = 2,axis=0, keepdims=True)\n",
    "def plot_latent(L, L_hat, L_rand, X_hat):\n",
    "    fig, (ax1, ax2, ax3, ax4) = plt.subplots(1, 4)\n",
    "    fig.set_size_inches(26.5, 6.5)\n",
    "    fig.suptitle('Latent Variable: Clean Ground-Truth (1st) vs Noisy Ground-Truth (2nd) vs PCA Recovered (3rd) vs Random Projection (4th)', fontsize=25)\n",
    "    ax1.plot(L[:, 0], L[:,1], \".\")\n",
    "    ax2.plot(X_hat[:, 0], X_hat[:,1], \".\")\n",
    "    ax3.plot(L_hat[:, 0], L_hat[:,1], \".\")\n",
    "    ax4.plot(L_rand[:, 0], L_rand[:,1], \".\")\n",
    "def generate_dim_m_widget():\n",
    "    return widgets.IntSlider(\n",
    "        value=50,\n",
    "        min=5, \n",
    "        max=100, \n",
    "        step=5,\n",
    "        description='dim_m',\n",
    "        continuous_update= False)\n",
    "def generate_sigma_n_widget():\n",
    "    return widgets.FloatSlider(\n",
    "        value=0.3,\n",
    "        min=0.01, \n",
    "        max=1, \n",
    "        step=0.01,\n",
    "        description='noise level',\n",
    "        continuous_update= False)\n",
    "def visualize(dim_m, sigma_n):\n",
    "    L, L_hat, L_rand, X_hat = gen_data_and_fit_PCA(dim_m, sigma_n)\n",
    "    plot_latent(L, L_hat, L_rand, X_hat)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "9b89b3d565384b848725c827e3acf17b",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "interactive(children=(IntSlider(value=50, continuous_update=False, description='dim_m', min=5, step=5), FloatS…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "interactive_plot = interactive(visualize,\n",
    "                               dim_m=generate_dim_m_widget(),\n",
    "                               sigma_n=generate_sigma_n_widget()  \n",
    "                              )\n",
    "interactive_plot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. **Change $m$ (dim_m), what do you observe for the difference between ground-truth and recovered latent variable?**\n",
    "2. **Change noise level, what do you observe for small and big noise value?**\n",
    "3. **Why the recovered latent variable by PCA is aligned with x,y axis?**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part (b): $\\ell_1$ (LASSO) vs $\\ell_2$ (ridge) Regularization ##"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this part, we compare LASSO and ridge regression. The assumption is a bit different from the previous part: For data $\\vec{X} \\in \\mathbb{R}^m$ and labels $\\vec{y}$, $\\vec{y}$ comes from latent variable $\\vec{L} \\in \\mathbb{R}^l, l \\leq m$, while $\\vec{X}$ comes from $L$ and irrelevant feature $\\vec{R} \\in \\mathbb{R}^{m-l}$. We aim to use LASSO and ridge regression to recover the latent variable $L$.\n",
    "\n",
    "Specifically, let $m$ = 20 and number of sampled data points $N = 1000$. Assume $\\vec{L} \\sim \\mathcal{N}(0, 0.5), \\vec{R} \\sim \\mathcal{N}(0, 0.5)$. Data $X$ is a combination of $\\vec{L}, \\vec{R}$ and noise $\\vec{n_x} \\sim \\mathcal{N}(0, 0.1)$: $$X = [\\vec{L}, \\vec{R}]^T + \\vec{n_x}$$ \n",
    "Note that we set first $l$ dimension of $\\vec{X}$ to be from the latent variable simplicity of visualization.\n",
    "\n",
    "Labels $\\vec{y}$ is a linear transformation of $\\vec{L}$ plus noise $\\vec{n_y} \\sim \\mathcal{N}(0, 0.1)$: $$y = W\\vec{L} + \\vec{n_y}$$\n",
    "\n",
    "We will run LASSO and ridge regression on $(\\vec{X}, \\vec{y})$ to compare their performance on recovering the coefficient.\n",
    "\n",
    "Let's generate the data first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "def gen_data(dim_l, dim_m, N = 10, sigma_l = 0.5, sigma_m = 0.5, sigma_n = 0.05):\n",
    "    dim_y = 1\n",
    "    \n",
    "    noise_x = np.random.normal(0, sigma_n, (N, dim_m))\n",
    "    noise_y = np.random.normal(1, sigma_n, (N, dim_y))\n",
    "    \n",
    "    L = np.random.normal(0, sigma_l, (N, dim_l)) #generate_l(N)\n",
    "    R = np.random.normal(0, sigma_m, (N, dim_m - dim_l))\n",
    "    \n",
    "    X = np.hstack((L, R)) + noise_x\n",
    "\n",
    "    W = np.random.rand(dim_l, dim_y)\n",
    "    y = L@W + noise_y\n",
    "    return X, y, W"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To evaluate how well the method perform on recovering the weight, we calculate $\\ell_2$ norm between learned coefficient and ground-truth weight.\n",
    "\n",
    "Let's implement the function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "def pad_coef_gt(coef, coef_gt):\n",
    "    coef, coef_gt = np.squeeze(coef), np.squeeze(coef_gt)\n",
    "    coef_gt = np.squeeze(coef_gt)\n",
    "    coef_gt_ = np.zeros(coef.shape)\n",
    "    coef_gt_[:coef_gt.shape[0]] = coef_gt\n",
    "    return coef_gt_\n",
    "def diff(coef, coef_gt):\n",
    "    coef_gt = pad_coef_gt(coef, coef_gt)\n",
    "    coef, coef_gt = np.squeeze(coef), np.squeeze(coef_gt)\n",
    "    return np.linalg.norm(coef - coef_gt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another model we consider is debiased LASSO. Basically, LASSO can be considered as a dimension reduction/selection model, as irrelevant feature weights are set to 0. We can use LASSO to first select non-zero features, and then perform ordinary least squares (OLS) on the dimension-reduced features and corresponding labels. We refer to it as <em>debiased LASSO</em>.\n",
    "\n",
    "Let's implement the function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "def debiased_lasso(lassocoef, X, y, epsilon=1e-3):\n",
    "    ind = np.where(np.abs(lassocoef)>epsilon)[0]\n",
    "    OLSModel = Ridge(alpha = 0)\n",
    "    OLSModel.fit(X[:, ind], y)\n",
    "    \n",
    "    #pad coef\n",
    "    coef_ = np.zeros(X.shape[1])\n",
    "    coef_[ind] = OLSModel.coef_\n",
    "    \n",
    "    return coef_ #OLSModel.coef_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's run LASSO, ridge regression and debiased LASSO on $(\\vec{X}, \\vec{y})$ and plot coefficients of the ground truth and different models. We will also calculate difference (in terms of $\\ell_2$ norm) between weights of both models in detecting the latent dimensions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_coef(W, ridgecoef, lassocoef, dlasso_coef):\n",
    "    ridgecoef = np.squeeze(ridgecoef)\n",
    "    lassocoef = np.squeeze(lassocoef)\n",
    "    \n",
    "    fig, (ax1, ax2) = plt.subplots(1, 2)\n",
    "    fig.set_size_inches(13.5, 3.5)\n",
    "    fig.suptitle('Coefficient', fontsize=25)\n",
    "    #ax1.xticks(np.arange(0, len(ridgecoef), step=1))\n",
    "    ax1.plot(np.arange(len(W)), np.squeeze(W), \"r^\", label=\"Ground truth\")\n",
    "    ax1.plot(np.arange(len(ridgecoef)), np.squeeze(ridgecoef), \"bx\", label=\"Ridge\")\n",
    "    ax1.plot(np.arange(len(lassocoef)), np.squeeze(lassocoef), \"g.\", label=\"LASSO\")\n",
    "    ax1.legend()\n",
    "    \n",
    "    ax2.plot(np.arange(len(W)), np.squeeze(W), \"r^\", label=\"Ground truth\")\n",
    "    ax2.plot(np.arange(len(lassocoef)), np.squeeze(lassocoef), \"g.\", label=\"LASSO\")\n",
    "    ax2.plot(np.arange(len(dlasso_coef)), np.squeeze(dlasso_coef), \"m*\", label=\"Debiased LASSO\")\n",
    "    ax2.legend()\n",
    "    #ax1.show()\n",
    "    \n",
    "def generate_dim_l_widget():\n",
    "    return widgets.IntSlider(\n",
    "        value=2,\n",
    "        min=2, \n",
    "        max=19, \n",
    "        step=1,\n",
    "        description='dim_l',\n",
    "        continuous_update= False)\n",
    "def generate_dim_m_widget():\n",
    "    return widgets.IntSlider(\n",
    "        value=40,\n",
    "        min=10, \n",
    "        max=200, \n",
    "        step=5,\n",
    "        description='dim_m',\n",
    "        continuous_update= False)\n",
    "def generate_N_widget():\n",
    "    return widgets.IntSlider(\n",
    "        value=30,\n",
    "        min=10, \n",
    "        max=200, \n",
    "        step=5,\n",
    "        description='N',\n",
    "        continuous_update= False)\n",
    "def generate_sigma_n_widget():\n",
    "    return widgets.FloatSlider(\n",
    "        value=0.05,\n",
    "        min=0.01, \n",
    "        max=0.5, \n",
    "        step=0.01,\n",
    "        description='noise level',\n",
    "        continuous_update= False)\n",
    "def generate_lassoweight_widget():\n",
    "    return widgets.FloatSlider(\n",
    "        value=0.01,\n",
    "        min=0.01, \n",
    "        max=0.1, \n",
    "        step=0.01,\n",
    "        description='Lasso weight',\n",
    "        continuous_update= False)\n",
    "\n",
    "def visualize_ridge_lasso(dim_l,dim_m, N, sigma_n, lassoweight=0.01):\n",
    "    X, y, W = gen_data(dim_l, dim_m, N, sigma_n = sigma_n)\n",
    "    lassoModel = Lasso(alpha = lassoweight) \n",
    "    lassoModel.fit(X, y) \n",
    "\n",
    "    ridgeModel = Ridge(alpha = 0.01) \n",
    "    ridgeModel.fit(X, y)\n",
    "    \n",
    "    dlasso_coef = debiased_lasso(lassoModel.coef_, X, y)\n",
    "\n",
    "    W = pad_coef_gt(ridgeModel.coef_, W)\n",
    "    \n",
    "    GT = np.zeros(dim_m)\n",
    "    GT[range(dim_l)] = np.array([1]*dim_l)\n",
    "\n",
    "    plot_coef(W, ridgeModel.coef_, lassoModel.coef_, dlasso_coef)\n",
    "    print(\"l2 norm difference:\")\n",
    "    print(\"Ridge: \", diff(ridgeModel.coef_, W ) )\n",
    "    print(\"LASSO:\", diff(lassoModel.coef_, W ) )\n",
    "    print(\"Debiased LASSO:\", diff(dlasso_coef, W ) )\n",
    "    \n",
    "    #print(\"Ridge: \", FAR(ridgeModel.coef_, GT ) )\n",
    "    #print(\"LASSO:\", FAR(lassoModel.coef_, GT ) )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "7bbd34edd0e445c78d741b03fd547109",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "interactive(children=(IntSlider(value=2, continuous_update=False, description='dim_l', max=19, min=2), IntSlid…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "interactive_plot = interactive(visualize_ridge_lasso,\n",
    "                               dim_l=generate_dim_l_widget(),\n",
    "                               dim_m=generate_dim_m_widget(),\n",
    "                               N = generate_N_widget(),\n",
    "                               sigma_n=generate_sigma_n_widget(),\n",
    "                               lassoweight=generate_lassoweight_widget()\n",
    "                              )\n",
    "interactive_plot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Look at the left figure:**\n",
    "1. **Change dimension of latent variable $L$ (dim_l) and compare LASSO and ridge. Which model's learned coefficient is closer to the ground truth?**\n",
    "2. **Change dimension of $M$ (dim_m), number of points $N$ and noise level $\\sigma(n_x)$. What do you observe? (Please only change one variable at a time)**\n",
    "3. **Change lasso weight $\\lambda$ and find the best hyperparameter.**\n",
    "**Look at the right figure:**\n",
    "\n",
    "4. **Compare LASSO and debiased LASSO. Discuss which model performs better under specific circumstances (e.g. when feature dimension is larger than # data points, when there is much noise, etc.).**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Congrats! You have finished the notebook part. \n",
    "\n",
    "To sum up, we study PCA from latent variable's perspective and compare LASSO and ridge regression.\n",
    "\n",
    "As you notice, PCA can be considered as an <em>unsupervised learning</em> method as it works on data $X$ and reduce its dimension; while ridge regression can be considered as a <em>supervised learning</em> method as it requires both data $X$ and label $y$. Is there any connection between PCA and ridge regression? As we show for the debiased LASSO, $\\ell_1$ regularization can also be considered as a dimension reduction tool. Can PCA performs a similar job?\n",
    "\n",
    "Please return to the worksheet to continue."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
